-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dttools: priority queue data structure #3915
dttools: priority queue data structure #3915
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this implementation, it's nice and straightforward.
A few nits about the API documentation versus comments in the implementation below.
Now, the real question is, can it be worked into our various scheduling methods without breaking everything? Make sure you talk to @colinthomas-z80 about how/why we iterate over only part of the list at one time.
I was thinking this could possibly replace the list rotation and the way we only visit a subset of the tasks each scheduling attempt. As long as the priority includes whether the task is ready to run or not, i.e. data dependencies are present. We should be finding a good task without looking too deep in the list. |
When all tasks have the same priority, the queue might look like this (priority, data):
The first time to pop, we get
This is roughly like rotating the list. In contrast, when tasks have different priorities, the element with the highest priority will stay at the top of the queue if we repeatedly pop and push it. For example, with this priority queue: (the store order is specified here):
Popping the first element and then deciding to reconsider it later would place it back at the top of the queue, since its priority remains unchanged. However, in real-world scenarios, if an element is popped and set aside for later consideration, it probably shouldn't retain its high priority. Instead, we might decrease its priority by 1 or set it to the lowest priority, allowing other elements deeper in the queue a fair chance to be considered. For example, pop:
push back
|
What I'm uncertain about is that using a priority queue instead of a normal list to store ready tasks will eliminate the opportunity of scheduling tasks using a breadth-first approach (implementing FIFO in a priority is quite expensive, with the complexity of We can let the users to specify while scheduling algorithm they want to use, but that results in two different data structures existing to support |
I did a quick experiment with the script provided by Colin: #! /usr/bin/env python
import argparse
import sys
import time
import math
import ndcctools.taskvine as tv
import random
if __name__ == '__main__':
q = tv.Manager(port=9123)
satisfied = False
# initial task count
t_count = 5000
print(f"Submitting {t_count} tasks")
t_start = time.time()
for i in range(t_count):
t = tv.Task(":")
q.submit(t
t_submitted = time.time()
print(f"Time to submit {t_submitted-t_start}s : {t_count*4/(t_submitted-t_start)} tasks per second.")
while not q.empty():
t = q.wait(5)
t_end = time.time()
complete = t_end - t_start
print(f"Tasks completed in {complete}s : {t_count/(complete)} tasks per second.")
print()
quit() As mentioned by @colinthomas-z80, the reason to introduce list rotation is that in the manager end, it first scans the task ready list, to see if there are any tasks available to be scheduled, and then collects results that are ready on the worker. But, if there are no workers available, it still scans the whole ready list and see if it can schedule another. The solution is to only consider a small number of tasks each attempt, say 100. The looking through process will eventually break out it has scanned 100 tasks. However, the consequence of looking at only 100 tasks is that tasks deeper in the ready list are not treated with a fair chance to be scheduled. So list rotation is introduced, for the purpose of wrapping around everything as it proceeds. |
I’m not sure why there is so much variability in the results. Can you see if things change if you run your worker with a finite disk value, such as |
Ah, I set with finite values to disk and memory and the variability issue immediately disappears. |
Do we want to fully replace the list with a priority queue, or use a heuristic? |
Jin, your question is too vague to be answered. If you are proposing to replace the list with a priority queue, then you need to show us the changes needed in taskvine to use the new data structure. And we have to consider carefully whether it meets all of our needs. I don't know what you mean by "a heuristic". Do you have a specific heuristic in mind? |
Well, I can propose a temporary PR quickly to show what are the changes to replace the list with a priority queue after this one is merged. I don't know if it is a heuristic approach, but we can support both data structures and let the users make the decision which one they want to use. By default, they are using the link list, but if they want to specify some tasks with different priorities, they can tune a parameter like |
Proposed Changes
The idea came from the discussion in #3851.
Current TaskVine uses a Breadth First scheduling algorithm, which is implemented with a FIFO queue, meaning tasks are considered in the order they arrive. Alternatively, Depth First scheduling prioritizes tasks deeper in the compute graph, allowing them to be considered earlier when there are tasks with different priorities.
Both algorithms perform similarly without storage optimization techniques. However, in a large-scale compute graph, promptly pruning files helps workers reduce peak disk load, and Depth First scheduling makes it better. This is because tasks are interdependent, files can be consumed as soon as they are created with Depth First scheduling.
With a small HEP application consisting of nearly 1000 tasks, pruning using Breadth First scheduling (the current scheduling algorithm) reduces the peak disk usage by 50.45%, while pruning with Depth First scheduling achieves a reduction by 92.23%.
Merge Checklist
The following items must be completed before PRs can be merge.
Check these off to verify you have completed all steps.
make test
Run local tests prior to pushing.make format
Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)make lint
Run lint on source code prior to pushing.